In this lab you will learn



library(tidyverse)
library(spatstat) ## for Ripley's K analysis
library(igraph) ## for community-finding analysis
library(ggplotify) ## for function as.ggplot()
library(wordspace) ## for function dist.matrix()
theme_set(theme_bw())
theme_update(panel.grid = element_blank())



All populations exist in space. This may seem like a trivial observation, but it has profound implications for population dynamics. This is true of all life, and particularly so for sessile (i.e. stationary) organisms, such as plants. The spatial location of a plant is crucial for its ability to survive and reproduce, as it will govern its interactions with the environment. Conversely, a plant’s ability to survive in different environments will determine where it can recruit.

The dispersal stage is arguably one of the most consequential life stages of plants. Dispersal, as most spatial processes, is distance-based: seeds are more likely to disperse to some distances from the parent tree than others. This non-randomness in the movement of plants is likely to create non-random emergent patterns at the population level, and ultimately at the community (i.e. multispecies) level.

In this lab, we will examine the spatial distribution of trees in a tropical rainforest. We will look for signs of non-random ecological processes, such as distance-based dispersal and local environmental filters, by comparing the spatial distribution of our species against a null model where trees are randomly located. Our study site is Barro Colorado Island, in the middle of the Panama canal.


Description of the Barro Colorado Island (BCI) data

The 50-hectare plot at Barro Colorado Island, Panama, is a 1000 meter by 500 meter rectangle of forest inside of which all woody trees and shrubs with stems at least 1 cm in stem diameter have been censused. Every individual tree in the 50 hectares was permanently numbered with an aluminum tag in 1982, and every individual has been revisited six times since (in 1985, 1990, 1995, 2000, 2005, and 2010). In each census, every tree was measured, mapped and identified to species. Details of the census method are presented here, and a detailed description of the seven-census results is given here.

Figure: Barro Colorado Island, Panama. Photo credit: Christian Ziegler.


Reading and preparing the data

First, we need to load the data. We will pull it from my GitHub BCI repository, where I collected data files taken from the Smithsonian Tropical Research Institute (STRI) public BCI data repository.

Run the following command (it will take a second because it is a large dataset):

bci = 
  get(
    load(
      url(
        'https://github.com/rafaeldandrea/BCI/blob/master/bci.full7.rdata?raw=true'
      )
    )
  ) |>
  as_tibble()

Now let’s look at the data:

bci
## # A tibble: 394,658 × 20
##    treeID stemID tag    StemTag sp    quadrat     gx    gy Measu…¹ Censu…²   dbh
##     <int>  <int> <chr>  <chr>   <chr> <chr>    <dbl> <dbl>   <int>   <int> <dbl>
##  1      1     NA -05599 <NA>    swar… "4007"  800.   152.       NA      NA    NA
##  2      2     NA -22851 <NA>    hyba… "0718"  152.   379.       NA      NA    NA
##  3      3     NA -24362 <NA>    aegi… "0417"   95.2  358.       NA      NA    NA
##  4      4     NA -26589 <NA>    beil… "0007"   11.7  151.       NA      NA    NA
##  5      5     NA -26590 <NA>    fara… "0004"    7.70  96.2      NA      NA    NA
##  6      6     NA -26703 <NA>    hyba… "0210"   50.1  215.       NA      NA    NA
##  7      7     NA -26746 <NA>    tet2… "0218"   58.5  365.       NA      NA    NA
##  8      8     NA -27125 <NA>    des2… "0613"  140.   267.       NA      NA    NA
##  9      9     NA -27298 <NA>    crot… "0704"  158.    86.4      NA      NA    NA
## 10     10     NA -27784 <NA>    alse… ""       NA     NA        NA      NA    NA
## # … with 394,648 more rows, 9 more variables: pom <chr>, hom <dbl>,
## #   ExactDate <chr>, DFstatus <chr>, codes <chr>, nostems <dbl>, date <dbl>,
## #   status <chr>, agb <dbl[1d]>, and abbreviated variable names ¹​MeasureID,
## #   ²​CensusID


This is a large dataset cataloguing almost 395 thousand trees within the 1,000m-by-500m (50-hectare) Forest Dynamic Plot at Barro Colorado Island, Panama.

Notice the data is essentially a table where each row refers to a tree and each column describes one of the properties of the tree, such as its physical location, size, and species. For our purposes, we are interested in the geographic location, size, and species of each tree. So we will select only those relevant characteristics from the larger dataset by piping the verb select() to the bci dataset and specifying the desired variables:

bci = 
  bci |>
  select(quadrat, gx, gy, sp, dbh)


If we now visualize our data, we see only the five relevant columns:

bci
## # A tibble: 394,658 × 5
##    quadrat     gx    gy sp       dbh
##    <chr>    <dbl> <dbl> <chr>  <dbl>
##  1 "4007"  800.   152.  swars1    NA
##  2 "0718"  152.   379.  hybapr    NA
##  3 "0417"   95.2  358.  aegipa    NA
##  4 "0007"   11.7  151.  beilpe    NA
##  5 "0004"    7.70  96.2 faraoc    NA
##  6 "0210"   50.1  215.  hybapr    NA
##  7 "0218"   58.5  365.  tet2pa    NA
##  8 "0613"  140.   267.  des2pa    NA
##  9 "0704"  158.    86.4 crotbi    NA
## 10 ""       NA     NA   alsebl    NA
## # … with 394,648 more rows


Here, the column quadrat labels which 20m x 20m rectangular section of the 50-ha plot the tree was found, gx and gy give the x and y spatial coordinates of the tree (measured in meters from the bottom left corner of the plot), sp is a 6-letter code giving the species identity, and dbh stands for diameter at breast height (i.e. diameter of the main trunk measured 4.5 ft, or 1.37 m, from the ground).

Notice the NAs in the dbh of some trees. This is because this dataset also lists trees that were dead by the time this census was taken in 2010, but were included for compatibility with the previous censuses. We will exclude dead trees from our analysis. In fact, we will focus on trees meeting a size threshold of dbh > 100 mm, which are generally considered to have recruited past the sapling stage. We achieve this by filtering the data set to those trees whose dbh was greater than 100 mm:

bci = 
  bci |>
  filter(dbh >= 100)

bci
## # A tibble: 20,802 × 5
##    quadrat    gx    gy sp       dbh
##    <chr>   <dbl> <dbl> <chr>  <dbl>
##  1 4924     994.  488. gustsu   308
##  2 4924     990.  489. virosu   358
##  3 4924     994.  498. quaras   318
##  4 4923     993.  469. protte   479
##  5 4923     982.  474. brosal  1290
##  6 4922     983   453. quaras   500
##  7 4921     992.  430. anacex  1660
##  8 4921     992.  432. tri2tu   398
##  9 4921     998.  434  poular   400
## 10 4921     999.  430. beilpe   602
## # … with 20,792 more rows


Notice that this trims down the data to about 21 thousand trees.



**Note: Make sure to apply this filter in your answer worksheet!**


Q1: What is a possible problem with the approach of using a single dbh cutoff for all species as indication that a tree has recruited past the sapling stage? What alternative may be used instead?


Visual inspection of the data

We can plot our data to examine the spatial distribution of our trees. In the plot below, each dot is a tree, colored by its species identity.

plot_bci = 
  bci |>
  ggplot(aes(gx, gy, group = sp, color = sp)) +
  theme(
    legend.position = 'none',
    aspect.ratio = 0.5
  )

plot_bci +
  geom_point(size = .5)

Let’s zoom in on a 200m \(\times\) 100m patch at the center of the forest plot for a more detailed visual:

plot_bci +
  geom_point() +
  coord_cartesian(xlim = c(400,600), ylim = c(200, 300))

The question is, are those trees randomly distributed on the plot, or is there a non-random pattern to their spatial distribution?


Testing for random spatial distribution

Let’s first ignore the species label, and treat all trees the same. We can then imagine each of the ~21k trees as a point on a plane, and we want to know whether the points are randomly scattered.

The simplest model of spatial randomness is to assume an initially empty plot to which we add N trees, one at a time, under the following rules:

These simple assembly rules correspond mathematically to drawing a set of N independent gx and gy coordinates, such that for each tree any value gx \(\in\) [0, 1000] and gy \(\in\) [0, 500] are are equally likely to be drawn.

As we have seen in lecture, if we then divide our space into many smaller sections (quadrats) and count the number of trees in each section, we expect those counts to follow a Poisson distribution.

The plot below zooms shows 20m \(\times\) 20m quadrats on our 200m \(\times\) 100m central patch:

plot_bci +
  geom_point() +
  theme(panel.grid.major = element_line(color = 'grey')) +
  scale_x_continuous(breaks = seq(400, 600, 20)) +
  scale_y_continuous(breaks = seq(200, 300, 20)) +
  coord_cartesian(xlim = c(400,600), ylim = c(200, 300))

tabulation = 
  bci |>
  group_by(quadrat) |>
  count() |>
  ungroup()

tabulation
## # A tibble: 1,250 × 2
##    quadrat     n
##    <chr>   <int>
##  1 0000       12
##  2 0001       19
##  3 0002       15
##  4 0003       16
##  5 0004       19
##  6 0005       21
##  7 0006       10
##  8 0007       17
##  9 0008       19
## 10 0009       18
## # … with 1,240 more rows


If indeed tree counts inside the quadrats \(n = 12, 19, 15, 16, ...\) are Poisson-distributed, then the probability that a quadrat contains k trees should be

\[P(n = k) = e^{-\lambda} \frac{\lambda ^ k}{k!}\],

where \(\lambda\) is the mean tree count across all quadrats.

We can easily find the value of \(\lambda\) by running

lambda = mean(tabulation$n)

lambda
## [1] 16.6416


Q2. Since the BCI plot is 1000 m \(\times\) 500 m and the quadrats are 20 m \(\times\) 20 m, this means there are 1,250 quadrats in the plot. Based on this and the mean number of trees per quadrat above, how many trees are there in the plot?


Q3: Recall that each row of the tibble bci lists one tree in the plot. So we can find out how many trees there are on the BCI plot directly – i.e. without the quadrat middleman – by simply typing nrow(bci), or equivalently, piping the verb nrow to the dataset bci. Show code that does the latter. Be careful not to assign the result to the object bci, as that would turn it into a number!


Q4. Recall that in a Poisson distribution, the variance equals the mean. Therefore if the tree counts in the quadrats are indeed sampled from a Poisson distribution, their variance should be similar to their mean. Calculate the variance of tabulation$n and check whether that is the case. Is it close to the mean?

A more rigorous test of whether the data are Poisson-distributed is to fit a Poisson distribution to it and check the goodness of fit. A Poisson distribution has only one free parameter, namely the mean, \(\lambda\). We have already calculated our best estimate for the parameter as the sample mean above, \(\lambda \simeq 16.6\). So now we can compare the empirical distribution of the tree counts against predictions from the fitted Poisson distribution.

data_hist = hist(tabulation$n, plot = FALSE, breaks = 15)

observed = 
  tibble(
    n = data_hist$mids,
    density = data_hist$density
  )

expected = 
  tibble(
    n = (min(tabulation$n) - 1):(max(tabulation$n) + 1),
    density = dpois(n, lambda = lambda)
  )

plot_histogram = 
  ggplot() +
  geom_col(data = observed, aes(n, density), color = 'black', fill = 'grey50') +
  geom_line(data = expected, aes(n, density), size = 1.5, color = rgb(153/255, 0, 0)) +
  theme(aspect.ratio = .5) +
  labs(x = 'Tree count', y = 'Frequency') +
  ggtitle('Histogram of tree counts across quadrats')
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
plot_histogram

This looks like a pretty good fit! The frequency of quadrats with different tree counts (grey bars) matches the Poisson expectations (red curve) pretty well.


Q5: Rerun the code above, this time changing the breaks argument in function hist on the first line to 35. This will remake the histogram using thinner bins. The Poisson fit still looks pretty decent, but a small bias in the Poisson predictions arguably becomes more visible: namely, the Poisson fit slightly overpredicts crowded quadrats and underpredicts sparse quadrats. Explain in your own words how the bias is visible from the plot.

A better way to visually compare observed data against fitted probability distributions is to look at the empirical cumulative distribution function (ecdf) in the data and compare it to the predicted cumulative distribution function, \(F(k) \equiv P(n \leq k)\), which reads: “\(F(k)\) is defined as the probability that a randomly selected species has up to \(k\) trees”. We can do that with the following code:

expected_cumulative = 
  tibble(
    n = (min(tabulation$n) - 1):(max(tabulation$n) + 1), 
    y = ppois(n, lambda = lambda)
  )

plot_cumulative = 
  ggplot() + 
  stat_ecdf(data = tabulation, aes(n), geom = 'point') +
  geom_line(data = expected_cumulative, aes(n, y), color = rgb(153/255, 0, 0)) +
  theme(aspect.ratio = .5) +
  labs(x = 'Tree count', y = 'Cumulative proportion')

plot_cumulative

In this plot, the y axis shows the proportion of quadrats with tree count up to the amount shown on the x axis.


Q6: Explain why the empirical distribution is always monotonic (i.e. non-decreasing).

Note: Unlike the histogram, the ecdf plot does not require binning the data. This is an advantage of the ecdf, as the shape of the histogram will depend on the number (or width) of bins that we choose to use. Another reason the ecdf is a better visual comparison is that if the data is sparse or the bins are too narrow, some bins may be empty, creating illusory “valleys” in the shape of the histogram.

Back to BCI, the Poisson fit looks very good! It seems that BCI trees in general are indeed distributed randomly across the 50-hectare plot. As we discussed, this is precisely what we would expect if those trees were randomly located around the plot with no influence from each other or external factors such as local environmental conditions (soil nutrient concentrations, drainage, topography, etc) or interactions between the trees.

Is this conclusive proof that recruitment of trees in BCI is a fully random process?


Case study: Gustavia superba

To answer that, let’s now look at an individual species rather than the full aggregate. We will focus our analysis on Gustavia superba, a small understory tree also known as the heaven lotus tree.

Figure: Gustavia superba, a small understory tree. Photo by the National Gardening Association. Flower, by The Garden of Eden

The code below shows the spatial distribution of heaven lotus trees by filtering our entire data set bci for only those trees whose species code is gustsu:

plot_gustsu = 
  bci |>
  filter(sp == 'gustsu') |>
  ggplot(aes(gx, gy, color = sp)) +
  theme(
    legend.position = 'none',
    aspect.ratio = 0.5
  ) +
  ggtitle('Gustavia superba')

plot_gustsu +
  geom_point()

The spatial distribution doesn’t look random.

We can now repeat the quadrat analysis above for this species. Let’s plot the 20 m \(\times\) 20 m quadrats:

plot_gustsu +
  geom_point(size = .5) +
  theme(panel.grid.major = element_line(color = 'grey')) +
  scale_x_continuous(breaks = seq(0, 1000, 20)) +
  scale_y_continuous(breaks = seq(0, 500, 20)) +
  theme(axis.text = element_blank())

We run into a problem: given the reduced tree count now that we are looking at a single species, the vast majority of quadrats are empty, and most of the others contain only 1 or 2 trees. In order for our quadrat analysis to make sense, we need to use bigger quadrats than 20 m \(\times\) 20 m. Let us instead use 100 m \(\times\) 100 m quadrats.

plot_gustsu +
  geom_point() +
  theme(panel.grid.major = element_line(color = 'grey')) +
  scale_x_continuous(breaks = seq(0, 1000, 100)) +
  scale_y_continuous(breaks = seq(0, 500, 100)) +
  theme(axis.text = element_blank())


Note: In general, when we try to fit data to a statistical distribution, we want to have a large number of data points and some variation across the data. In quadrat analysis, these considerations will weigh into our choice of quadrat size. The chosen size of 100 m \(\times\) 100 m strikes a good balance in our case.

The code below tabulates the counts of Gustavia superba across 100m \(\times\) 100m quadrats.

quadrat_finder = function(gx, gy, Lx, Ly){
  
  nx = 1000 / Lx
  ny = 500 / Ly
  
  qx = findInterval(gx, Lx * (0:nx))
  qy = findInterval(gy, Ly * (1:ny))
  
  quadrat = qx + nx * qy
  
}


tabulation_gustsu = 
  bci |>
  mutate(quadrat = quadrat_finder(gx, gy, Lx = 100, Ly = 100)) |>
  group_by(quadrat) |>
  summarize(n = sum(sp == 'gustsu'), .groups = 'drop')

tabulation_gustsu
## # A tibble: 50 × 2
##    quadrat     n
##      <dbl> <int>
##  1       1     8
##  2       2     2
##  3       3     3
##  4       4    10
##  5       5     5
##  6       6     7
##  7       7     3
##  8       8     1
##  9       9     5
## 10      10    11
## # … with 40 more rows


Q7. a) What is the mean count of heaven lotus trees across all quadrats? b) Assign the result from (a) to an object named lambda_gustsu. Show your code. c) Using the full data set, calculate the percentage of the trees in the BCI plot that are heaven lotus trees.


Q8. Based on the mean-vs-variance test, does it seem that the spatial distribution of Gustavia superba on BCI is well described by a Poisson process?


Q9. In the birds-on-a-line example from lecture, we found that the variance in bird counts across segments was much lower than the mean, and concluded that the birds were evenly spaced. How do your results for the heaven lotus tree compare, and what do you conclude? Does your conclusion match what you would guesstimate by visual inspection of the Gustavia superba plot above?


The code below plots the cumulative empirical distribution function for G. superba.

expected_cumulative_gustsu = 
  tibble(
    n = (min(tabulation_gustsu$n) - 1):(max(tabulation_gustsu$n) + 1), 
    y = ppois(n, lambda = lambda_gustsu)
  )

plot_cumulative_gustsu = 
  ggplot() + 
  stat_ecdf(data = tabulation_gustsu, aes(n), geom = 'point') +
  geom_line(data = expected_cumulative_gustsu, aes(n, y), color = rgb(153/255, 0, 0)) +
  theme(aspect.ratio = .5) +
  labs(x = 'Tree count', y = 'Cumulative proportion') +
  ggtitle('Gustavia superba')

plot_cumulative_gustsu

The plot reveals two clear outliers: two quadrats with an enormous number of G. superba trees compared to all other quadrats. (Looking back at the spatial plot, it is not difficult to find those quadrats.)

Sometimes, researchers wish to examine the behavior of the bulk of the data, which they do by setting outliers aside (which is not to say outliers can be ignored!) In our case, one could hypothesize that these outlier quadrats are driving the statistical inference. Perhaps if we remove them from the analysis, the other quadrats are a good Poisson fit?

The code below removes those two heavily populated quadrats, and shows a histogram of the tree counts inside the other quadrats, overlaid with the Poisson fit.

filtered_gustsu = 
  tabulation_gustsu |>
  filter(n < 50)

lambda_filtered_gustsu = mean(filtered_gustsu$n)

data_hist = hist(filtered_gustsu$n, plot = FALSE, breaks = 15)

observed = 
  tibble(
    n = data_hist$mids,
    density = data_hist$density
  )

expected = 
  tibble(
    n = (min(filtered_gustsu$n) - 1):(max(filtered_gustsu$n) + 1),
    density = dpois(n, lambda = lambda_filtered_gustsu)
  )

plot_histogram = 
  ggplot() +
  geom_col(data = observed, aes(n, density), color = 'black', fill = 'grey50') +
  geom_line(data = expected, aes(n, density), size = 1.5, color = rgb(153/255, 0, 0)) +
  theme(aspect.ratio = .5) +
  labs(x = 'Tree count', y = 'Frequency') +
  ggtitle('Histogram of tree counts across quadrats')

plot_histogram


Compared to the Poisson curve, we see that both very low- and very high-abundance quadrats are more common than expected, and intermediate-abundance quadrats are rarer than expected. (In fact, the data seems bimodal–i.e. there are two local peaks rather than a single peak as in the Poisson curve.)


Q10. How do you interpret this in terms of whether the trees are spatially clumped or evenly dispersed?


Q11. Do you conclude that the outlier quadrats were or were not driving our statistical inference?



Spatial descriptive statistics - Ripley’s K

A different method of spatial analysis which does not require using quadrats is based on examining the neighborhood of each data point (each tree in our case) at different scales, and look for departures from the Poisson process.

Suppose we drop a pin on a random tree on BCI, and draw a circle of radius \(r\) around it. In a Poisson process, the expected number of trees to be found inside that circle is proportional to the circle’s area, \(\pi r^2\). If we observe that trees on BCI tend to have many more neighbors than this null expectation, that would suggest spatial clustering at scale \(r\). Conversely, finding many fewer neighbors than expected suggests spatial overdispersion at that scale. As we sweep across different scales, we can look for scales where BCI departs from the Poisson process.

This is the idea behind Ripley’s K analysis. For a given data set, we can empirically calculate Ripley’s K function \(K_{Emp}(r)\) by tallying the number of neighbors of a focal tree inside a circle of radius \(r\) centered on that tree, then averaging across all focal trees. For a Poisson process, the theoretical value of Ripley’s K is known, namely \(K_{Pois}(r) = N/A \pi r^2\), where \(N\) is the number of data points and \(A\) is the area of the plot. So for our analysis, we can plot the ratio \(K_{Emp} / K_{Pois}\) and look for departures from the expected ratio of 1.

The code below writes functions to perform our Ripley’s K analysis (you don’t need to udnerstand the code).

BCI_ppp = function(species){
  if(species == 'all'){
    dtf = bci
  } else{
    dtf = 
      bci |>
      filter(sp == species)
  }
  
  dtf |> 
  select(gx, gy) |> 
  unique() |>
  as.ppp(W = list(xrange = c(0, 1000), yrange = c(0, 500)))
}

K_tibble = function(species){
  x = BCI_ppp(species)
  Kinhom(x, correction = 'isotropic') |>
  as_tibble() |>
  mutate(sp = species)
}


All species

Now we call our function K_tibble() passing all as the species argument in order to perform the analysis for all BCI species in aggregate. We then plot the result.

data_allspp = K_tibble(species = 'all') 

plot_K = 
  data_allspp |>
  filter(r > .75) |>
  ggplot() +
  geom_hline(yintercept = 1, color = rgb(153/255, 0, 0)) +
  geom_line(aes(r, iso/ pi / r^2), size = 1) +
  theme(aspect.ratio = 1) +
  ylab('Observed K  /  Expected K') +
  xlab('circle radius')


plot_K +
  scale_x_log10()


As we can see, the empirical Ripley’s K is very close to the null expectation at distances larger than 10 m from a tree. This suggests that indeed at distance scales beyond 10 meters, BCI is indistinguishable from a Poisson process, meaning that the spatial distribution of trees is indistinguishable from random at those scales.


Q12. Is this result consistent with the mean / variance test we performed above when gridding the plot into 20m \(\times\) 20m quadrats? Explain your answer.


Below this distance threshold, however, we have indication of overdispersion, i.e. the immediate neighborhood of existing trees is sparse. In other words, BCI trees are surrounded by fewer close neighbors than expected by chance.


Q13. Thinking about light and nutrients, provide an ecological hypothesis for tree-tree interactions that is consistent with this result.


Whatever the reason, these results suggest that tree-tree interactions on BCI are likely contained within 10 meters of each other.


Gustavia superba

Let’s rerun the Ripley’s K analysis, this time focusing on the heaven lotus tree.

data_gustsu = K_tibble(species = 'gustsu') 

plot_K = 
  data_gustsu |>
  filter(r > .75) |>
  ggplot() +
  geom_hline(yintercept = 1, color = rgb(153/255, 0, 0)) +
  geom_line(aes(r, iso/ pi / r^2), size = 1) +
  theme(aspect.ratio = 1) +
  ylab('Observed K  /  Expected K') +
  xlab('circle radius')


plot_K +
  scale_x_log10()

Now we get the opposite result!

At very short distances (\(r < 2.5\) m) the empirical Ripley’s K is much higher than expected, suggesting strong clumping at these scales. This likely reflects the large clump of trees observed between gx = 600-750 m and gy = 400-500 m. However, we see an overabundance of heaven lotus trees neighboring each other even beyond 2.5 m apart, with the clumping effect decreasing at larger distances.


This is consistent with our previous quadrat-based results for G. superba, except now we are getting a more detailed distance-specific description of the spatial distribution of this species.


Figure: Fruit and seed of the heaven lotus tree. Photos by Spencer Woodard (fruit) and Roger Culos (seed).


Q14: Provide a biological hypothesis for the clumping of G. superba.


Q15. Repeat the analysis above for another species, Trichilia tuberculata (species code “tri2tu”), a large, common canopy tree on BCI. Comparing results for this species with those of Gustavia superba and the analysis for all species, what differences and similarities do you see?


Figure: Flower, leaves, and fruit of Trichilia truberculata. Photos by Riley Fortier (flower), Andrés Hernández, STRI.


While our empirical Ripley’s K departs from expectations at various scales, we need a more sophisticated analysis in order to be confident that these results could not come about by chance. This is because even a true Poisson process may deviate from expectations by chance, just like a fair coin may not turn up exactly 5 heads and 5 tails out of 10 tosses. If we got 6 heads and 4 tails, how confident should we be that the coin is biased towards heads? What if we got 7 heads and 3 tails?

In order to verify that we are seeing convincing departures from random expectations, we must assess how often the random process could generate departures at least as large as the ones we saw. One way to do that is by simulating the random process a large number of times, and then check what proportion of those simulations show even greater departures from expectations than our data. If that percentage is very small, then we can confidently reject the possibility that the apparent non-randomness in our data is a fluke — i.e. we can then confidently reject the null model. In that case, we say the result is statistically significant. This is the idea behind p-value tests, a staple of statistical inference.

The code below performs 1,000 simulations of a Poisson process (our null model). Based on these simulations, it determines, for each distance, what interval of Ripley’s K values above and below the expectations could have arisen by chance. It then plots that interval as a pink ribbon. Values outside the ribbon have a less than 5% chance of arising from the null model (i.e. by chance), and are thus deemed statistically significant.

Note: The last sentence above is only approximately right. It would be exactly true if we used infinitely many replicates of the random process to obtain the ribbon. Of course we can’t do that, so the plotted interval obtained with a finite sample of random processes is an estimate of the theoretical interval that would arise from an infinite set.

focal_species_name = 'Gustavia superba'
focal_species_code = 'gustsu'

focal_pp = BCI_ppp(focal_species_code)

null_tibble = 
  envelope(
    focal_pp, 
    Kinhom, 
    nsim = 999, 
    nrank = 25, 
    verbose = FALSE
  )

plot_RipleysK_with_quantiles =
  ggplot() +
  geom_hline(yintercept = 1, color = rgb(153 / 255, 0, 0)) +
  geom_ribbon(
    data = 
      null_tibble |> 
        filter(r > 0.75),
    aes(x = r, ymin = lo / pi / r^2, ymax = hi / pi / r^2),
    fill = 'red',
    alpha = 0.3
  ) +
  geom_line(
    data = 
      null_tibble |>
      filter(r > .75), 
    aes(r, obs / pi / r^2),
    linewidth = 1
  ) +
  ylab('Observed K  /  Expected K') +
  xlab('circle radius') +
  theme(aspect.ratio = 1) +
  ggtitle(focal_species_name) +
  scale_x_log10()

plot_RipleysK_with_quantiles +
  geom_vline(
    xintercept = c(1.25, 5), 
    color = c('darkgreen', 'blue'), 
    linewidth = 2
  )

We can see that the clustering within tree-tree distances \(r < 1.25\) m (green line) is statistically significant, far exceeding the upper bound of our null expectations. Furthermore, the clustering at distances \(>\) 5 m (blue line) is also significant. But now we see that the spatial distribution of Gustavia superba trees at ranges between \(r =\) 1.25 m and 5 m is consistent with the null model — in other words, we cannot reject with greater than 95% confidence the possibility that it could have arisen by chance.


Q16. Repeat the analysis above for Trichilia tuberculata by tweaking the code above in appropriate places (Also, delete the green and blue lines by deleting the geom_vline() verb at the bottom of the code, along with the plus sign at the end of the line above it). Does any of your conclusions in Q15 need revising?


Optional bonus point: If you wish, repeat the analysis for all species on BCI. This will take a lot longer to run, because the numerical simulations of the random process get slower for the much higher number of points required, which is now number_of_points = nrow(bci). You will also need to set focal_species_name = "All species" and focal_species_code = "all". Do not attempt this before finishing the rest of the lab because the code takes a long time to run!.



Spatial associations between species

Figure: Is there a spatial association between different species on BCI? Photo by Christian Ziegler.


So far we have been examining whether all species as an aggregate or each species individually show signs of departure from random placement in the forest plot. We have seen that as a whole the forest is not very different from a random assemblage of points in a plane, but species considered individually show a clearly nonrandom spatial pattern, reflecting underlying ecological processes. But can we go further and ask about nonrandom associations between species?

Our Ripley’s K analysis revealed that conspeficic trees (i.e. trees belonging to the same species) tend to recruit near each other. How about heterospecific trees? Can we find groups of BCI species that recruit near each other more often than expected by chance?

In the following, I will present some of my own research on spatial patterns among BCI trees. The question here is whether we can use the BCI censuses to find evidence of ecological similarities between those species. For example, tropical tree species differ with regard to affinity for local environmental conditions such as soil nutrient profiles, soil drainage, pH, topography, etc, and we want to know whether (a) each species has their own unique strategy or (b) species can be grouped into “ecological niches” characterized by similar preferences for local environments. If the latter is the case, and given that local environments on BCI are on a spatial gradient (figure below), then we may expect species with similar strategies (i.e. species occupying the same environmental niche) to occur near each other.


Figure: Countour map of BCI soil conditions (nutrient concentration) shows clear spatial gradients. From John et al. PNAS 2007


Consider two species, A and B. Measure the distances between every pair of trees of species A and B, and calculate what fraction of these pairs are within a small distance \(d^*\) of each other. By comparing that fraction to the expectation from a Poisson process, we can determine whether species A and B tend to occur near each other more than expected by chance.

Now imagine BCI as a network, with each species being a node (figure below). For each species, draw a connecting line to all other species found to occur in close proximity to it more often than expected by chance. For BCI, here is what that network looks like:

Notice that every species is connected to some but not all other species. We now ask whether we can find groups of species that are highly connected to each other but not to other species. (In social networks, the equivalent would be a group of people that tend to be friends with each other and not with other people.)

To find out, I used a so-called community finding algorithm from the R package igraph, whose details need not concern us. Results are as follows:

This is the same graph as above, except color-coded by groups. As you can see, the algorithm found three distinctive groups in the BCI plot. The links between same-group species are shown in black, and links across groups are shown in red.


Q17. What does the result above reflect in terms of the spatial distribution of species on BCI? How might we interpret this result in terms of environmental niches?


Note: You may have noticed the network above does not include all the BCI species. This is because I am restricting the analysis to species with at least 200 trees in the BCI plot. I do this not just for visual clarity of the network above, but also because it is difficult to determine whether rare species tend to occur near each other more than expected by chance because we don’t have enough pairs of those species to make reliable statistical estimates.

Note: In addition to the abundance threshold, there is another free parameter to this analysis: the neighborhood scale \(d^*\) that we check for departures from randomness. I am using \(d^* = 10\) m, reflecting our finding above that interactions between trees tend to occur within this scale (recall that beyond this scale, the spatial distribution of trees is indistinguishable from a Poisson process). If we use different thresholds within a reasonable range (say, between 10 and 20 m), we get a slightly different set of edges, but not different enough to qualitatively change the results.


Let’s look at the spatial map of our trees again, this time coloring them based on community membership:

Notice regions of the plot dominated by a single color. This is another way to visually confirm that species in the same group tend to occur near each other.

Notice that some species pairs commonly found near each other (i.e. connected by a line) were classified into distinct groups, as reflected by the presence of the cross-links in red. If we count the number of within-group links (black lines) and compare it to the total number of links (black lines and red lines), the ratio of these two numbers will give us a measure of the cohesion of the groups. This index is called the modularity of the network. The higher the modularity, the more tight and well-separated the groups.

You may be asking yourself, even random networks may contain densely connected groups by chance. How do we know our result is statistically significant?

Good question! As we have seen, one way to address it is with a p-value test, where we estimate the probability that random data could produce results as distinct from null expectations as our own data. To do this we compare the modularity of our network to the modularity of random networks.


Figure: Example of a high-modularity network on the left (A), and a randomized low-modularity network on the right (B). Notice that there are only a few links between nodes of different colors in A but many in B. Image from Wang & Zhang 2007.


Q18. The modularity of a network can range from -0.5 to 1. Our network has a modularity of 0.28. By comparison, a set of 1,000 randomized networks with the same number of species and links had a mean modularity of 0.1, and 95% of them were below 0.11. All were below 0.28. What can we conclude about the statistical significance of our groups?


As an informal way to verify that the results above are not a fluke, let’s generate and plot a random network with the same number of species and links as our network, and use the same algorithm to find groups in the network.


The plotting algorithm plots highly-connected nodes close to each other and weakly-connected nodes far apart. As a result, the spatial appearance of the groups reflects their modularity.


Q19. Between BCI and the random network above, in which one do the groups look tighter and better separated from each other? In other words, which network (BCI or the random one) seems to have higher modularity?


Note: Another issue to consider is whether the result is robust. If we change the parameters of the analysis (the abundance threshold and the distance threshold \(d^*\)), do we get qualitatively different results? In particular, do we get a different number of communities, and/or does the species composition of the communities change? Those are also good questions which we will not get into here, but which you must always keep in mind when performing statistical analyses.